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ABSTRACT 

Load balancing is a widely accepted technique for perfor¬ 
mance optimization of scientific applications on parallel ar¬ 
chitectures. Indeed, balanced applications do not waste pro¬ 
cessor cycles on waiting at points of synchronization and 
data exchange, maximizing this way the utilization of pro¬ 
cessors. In this paper, we challenge the universality of the 
load-balancing approach to optimization of the performance 
of parallel applications. First, we formulate conditions that 
should be satisfied by the performance profile of an appli¬ 
cation in order for the application to achieve its best per¬ 
formance via load balancing. Then we use a real-life scien¬ 
tific application, MPDATA, to demonstrate that its perfor¬ 
mance profile on a modern parallel architecture, Intel Xeon 
Phi, significantly deviates from these conditions. Based on 
this observation, we propose a method of performance opti¬ 
mization of scientific applications through load imbalancing. 
We also propose an algorithm that finds the optimal, possi¬ 
bly imbalanced, configuration of a data parallel application 
on a set of homogeneous processors. This algorithm uses 
functional performance models of the application to find the 
partitioning that minimizes its computation time but not 
necessarily balances the load of the processors. We show 
how to apply this algorithm to optimization of MPDATA 
on Intel Xeon Phi. Experimental results demonstrate that 
the performance of this carefully optimized load-balanced 
application can be further improved by 15% using the pro¬ 
posed load-imbalancing optimization. 
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1. INTRODUCTION 

Load balancing is a widely accepted technique for optimiza¬ 
tion of the computation performance of scientific applica¬ 
tions on parallel architectures. Indeed, the intuition sug¬ 
gests that unlike unbalanced applications, the balanced ones 


do not waste processor cycles on waiting at points of syn¬ 
chronization and data exchange, maximizing this way the 
utilization of the processors. 

In this paper, we challenge the universality of the load¬ 
balancing approach to optimization of the computation per¬ 
formance of parallel applications. First, we try to under¬ 
stand the limitations of the load-balancing approach. We 
formulate conditions that should be satisfied by the perfor¬ 
mance profile of an application in order for the application 
to achieve its best performance via load balancing. 

Then we use a real-life scientific application to demonstrate 
that its performance profile on a modern parallel architec¬ 
ture does not satisfy these conditions. The application we 
use implements the Multidimensional Positive Definite Ad- 
vection Transport Algorithm (MPDATA), which is one of 
the major parts of the dynamic core of the EULAG geophys¬ 
ical model |2l]. EULAG (Eulerian/semi-Lagrangian fluid 
solver) is an established numerical model developed for sim¬ 
ulating thermo-fluid flows across a wide range of scales and 
physical scenarios [19, 22]. In particular, it can be used 
in numerical weather prediction (NWP), simulation of ur¬ 
ban flows, areas of turbulence, ocean currents, etc. This 
solver, originally developed for conventional HPC systems, 
is currently being re-written for modern HPC platforms. In 
particular, MPDATA has been recently re-written and op¬ 
timized for execution on an Intel Xeon Phi coprocessor |25[ 

M- 

In our experiments, we observe significant deviations of the 
MPDATA performance profile from the conditions required 
for applicability of the load-balancing techniques. Based on 
this observation, we propose a general method of perfor¬ 
mance optimization of scientific applications through load 
imbalancing as well as an algorithm that finds the optimal, 
possibly imbalanced, configuration of a data parallel appli¬ 
cation on a set of homogeneous processors. This algorithm 
uses functional performance models of the application M] 
|~L3] to find the partitioning that minimizes its computation 
time but not necessarily balances the load of the proces¬ 
sors. Finally, we apply this algorithm to optimization of 
MPDATA on Intel Xeon Phi. Experimental results demon¬ 
strate that the performance of this carefully optimized load- 
balanced application can be further improved by 15% using 
the proposed load-imbalancing method. 


The contributions of the work presented in the paper are as 
follows: 


• Formulation of the conditions that should be satis¬ 
fied to guarantee that load balancing will minimize the 
computation time of parallel application. 

• Building the performance profile of a real-life scientific 
application on a modern HPC platform and demon¬ 
stration of its significant deviation from the conditions 
that guarantee that load balancing be a safe technique 
for performance optimization. 

• A new optimization method that uses the performance 
profile for optimization of the application through its 
imbalancing. 

• A partitioning algorithm finding the optimal and gen¬ 
erally speaking uneven distribution of computations of 
an application between homogeneous processing units 
using its functional performance model. 

• Application of the proposed partitioning algorithm to 
optimization of MPDATA on Intel Xeon Phi, resulting 
in further acceleration of this carefully optimized load- 
balanced application by up to 15%. 


The rest of the paper is structured as follows. Section [2] 
overviews load-balancing techniques and formulates the con¬ 
ditions when these techniques would minimize the compu¬ 
tation time of parallel applications. Section [3] analyzes the 
performance profile of MPDATA on Intel Xeon Phi and in¬ 
troduces the new approach to minimization of the compu¬ 
tation time through load imbalancing. Section [4] introduces 
a partitioning algorithm for (uneven) distribution of com¬ 
putations between homogeneous processors, minimizing the 
computation time of the application. Section [5] applies this 
algorithm to optimization of MPDATA on Xeon Phi. Sec- 
tion[6]presents experimental results, and Section[7]conchides 
the paper. 

2. LOAD BALANCING AND PERFORMANCE 

In this section we overview load-balancing techniques used 
for optimization of the performance of parallel scientific ap¬ 
plications on both homogeneous and heterogeneous plat¬ 
forms. We also formulate the conditions when application of 
these techniques will optimize the computation performance. 

Load balancing is a widely accepted and commonly used 
approach to performance optimization of scientific applica¬ 
tions on parallel architectures. Load balancing algorithms 
can be classified as static or dynamic. Static algorithms (for 
example, those based on data partitioning) [3| 27 [ IS, 113] 
require a priori information about the parallel application 
and platform. This information can be gathered either at 
compile-time or runtime. Static algorithms are also known 
as predicting-the-future because they rely on accurate per¬ 
formance models as input to predict the future execution 
of the application. Static algorithms are particularly use¬ 
ful for applications where data locality is important because 
they do not require data redistribution. However, these al¬ 
gorithms are unable to balance on non-dedicated platforms, 
where load changes with time. Dynamic algorithms (such 


as task scheduling and work stealing) [l5j [3| | 20 ] balance the 
load by moving fine-grained tasks between processors during 
the calculation. Dynamic algorithms do not require a priori 
information about execution but may incur significant com¬ 
munication overhead due to data migration. Dynamic algo¬ 
rithms often use static partitioning for their initial step due 
to its provably near-optimal communication cost, bounded 
tiny load imbalance, and lesser scheduling overhead [24] . 

Whatever load balancing algorithm is used, the goal is al¬ 
ways to minimize the computation time of the application. 
The intuition behind the assumption that balancing the ap¬ 
plication improves its performance is the following: a bal¬ 
anced application does not waste processor cycles on waiting 
at points of synchronization and data exchange, maximizing 
this way the utilization of the processors. Is this assumption 
always true? To answer this question, let us formulate the 
assumption in a mathematical form. Consider an applica¬ 
tion, the computation performance of which can be modeled 
by speed functions. Namely, let p parallel processors be used 
to execute the application, and let s*(x) be the speed of ex¬ 
ecution of the workload of size x by processor i. Here the 
speed can be measured in floating point operations per sec¬ 
ond or any other fix-sized computation units per unit time. 
The size of workload can be characterized by the problem 
size (say, the number of cells in the computational domain 
or the matrix size) or just by the number of equal-sized com¬ 
putational units. Anyway, the speed Si(x) is calculated as 
where U(x) is the execution time of the workload of 
size x on processor i. Using these definitions, we can formu¬ 
late the following theorem. 


Theorem 1 : Let si(x) > 0 (x > 0) be the speed of processor 
i £ {1,... ,p}, and VAx > 0: ^ Let Xl + 

.. ,+Xp = n> 0 and Sl ^ l) = ... = Sp ^ p) . Then, V 3 / 1 ,... ,y p 
> 0 such that ( 2 / 1 ,..., y p ) % (xi,...,x p ) and yi + ... + y p = 
n: max* y P , > X T , . 

SiiVi) ~ s 1 ( 11 ) 


Proof: As (yi,...,y P ) % (aci, —, x p ) and yi + ... + y p = 
xi + ... + x p , then there exists k G { 1 ,... ,p} such that 


Vk > %k- Therefore, max* U x > —7^ 

— s k (y k ) 

x k _ x 1 

— s k (x k ) — Sl(xi)* 


x k^~{Vk x k ) 

s k( x k + (Vk- x k)) 


Theorem 1 states that in order to guarantee that the bal¬ 
anced configuration of the application will execute the work¬ 
load of size n faster than any unbalanced configuration, the 
speed functions s$(x), characterizing the performance pro¬ 
files of the processors, should satisfy the condition: 


\/Ax > 0 : 


Si(x) Si(x + Ax) 
x ~ x + Ax 


(i) 


Geometrically, it can be illustrated as follows. If we plot a 
speed function as shown in Figure [l] then the angle a(x) 
between the straight line, connecting the point ( 0 , 0 ) and 
the point (x, s(x)) on the speed curve, and the x-axis will 
be inversely proportional to the execution time of the work¬ 
load of size x by the processor. Indeed, the cotangent of this 
angle is directly proportional to the ratio representing 
the execution time of the workload x. Therefore, larger an¬ 
gles correspond to shorter execution times. The condition^ 














means that the increase of the workload, x, will never result 
in the decrease of the execution time, or equivalently in the 
increase of the angle a(x). 


Speed 



Figure 1: Example of speed function suitable for 
minimization of computation time through load bal¬ 
ancing. Angle a(x) represents the computation time: 
the greater the angle, the shorter the computation 
time. 


The main body of the load balancing algorithms designed 
for performance optimization explicitly or implicitly assume 
that the speed of processor does not depend on the size 
of workload [8j [27, 16, [4j 10] 17 . In other words, the 
speed functions s7(i) are assumed to be positive constants, 
in which case the condition [l] is trivially satisfied. More 
advanced algorithms are based on functional performance 
models (FPMs), which represent the speed of processor by 
a continuous function of the problem size 14 11 . However, 
the shape of the function is not arbitrary but has to satisfy 
the following assumption jl3]: Along each of the problem 
size variables, either the function is monotonically decreas¬ 
ing, or there exists point x such that 


• On the interval [0, x], the function is 

— monotonically increasing, 

— concave, and 

— any straight line coming through the origin of 
the coordinate system intersects the graph of the 
function in no more than one point. 

• On the interval [x, oo), the function is monotonically 
decreasing. 


3. OPTIMIZATION OF PARALLEL APPLI¬ 
CATIONS THROUGH LOAD IMBALANC- 
ING 

In this section, we demonstrate that the performance profile 
of real-life scientific applications on modern parallel plat¬ 
forms may significantly deviate from the conditions, which 
guarantee that load balancing will always optimize their 
computation performance. Based on this observation, we 
propose an optimization method that uses the performance 
profile for optimization of the application through its imbal- 
ancing. 

In this work, we build the performance profile of MPDATA 
on Intel Xeon Phi. MPDATA is a core component of EULAG 
(Eulerian/semi-Lagrangian fluid solver), which is an estab¬ 
lished computational model developed for simulating thermo¬ 
fluid flows across a wide range of scales and physical scenar¬ 
ios. Its carefully optimized data-parallel implementation on 
a 60-core Intel Xeon Phi [26 partitions the 3D rectilinear 
nxnx l domain into four equal ^ X ^ x l sub-domains, each 
allocated to a team of 15 cores. This configuration of the 
application is the best load-balanced configuration identified 
in 26]. The experimentally constructed speed functions of 
these four teams, each processing (in parallel) a 120xmx 128 
sub-domain, are shown in Figure [2] 



Figure 2: Speed functions of Intel MIC built for 
four 15-core teams, each processing in parallel a 120 x 
m x 128 sub-domain. The speed is measured in cells 
per second, while the problem size is represented by 
m. 


These restrictions on the shape of speed functions guarantee 
that the efficient load balancing algorithms, proposed in 19, 
0 m e i will always return a unique solution, mini¬ 
mizing the computation time. At the same time, it is easy 
to show that the restrictions imposed on FPMs will make 
them comfortably satisfy the condition [l] 

Thus, the state-of-the-art load balancing algorithms designed 
for optimization of the computation performance of paral¬ 
lel applications assume that their performance profiles sat¬ 
isfy the condition^ Therefore, correct application of such 
algorithms requires that the experimental speed points be 
approximated by a function satisfying this condition. This 
approximation step may significantly distort the actual per¬ 
formance profile and lead to a substantially non-optimal so¬ 
lution. 


This graph clearly shows that for many m and Am the 
speed of processing of the 120 x m x 128 sub-domain will 
be significantly lower than the speed of processing of the 
120 x (m + Am) X 128 sub-domain. Moreover, we can also 
see that a(m + Am) > a(m) for some such m and Am, 
which means that the time of processing of the 120 x m x 128 
sub-domain will be longer than the time of processing of the 
120 x (m + Am) x 128 sub-domain. The latter observation 
can be used to speed up the execution of the application. 
Namely, if we re-partition the equally partitioned domain so 
that two teams get 120 x (m + Am) X 128 sub-domains and 
two other teams get 120 x (m —Am) X 128 sub-domains, and 
min {a(m + Am),o(m — Am)} > <a(m), than this unequal 
and unbalanced partitioning will result in faster execution. 

In general, if the performance profile of an application vio- 

















lates the condition [I] that is, 


3i G ,p}3x > 03Ax > 0: 


Si(x) Si(x + Ax) 

x x + Ax 


( 2 ) 


and the balanced configuration of the application allocates 
the workload of size x to processor z, then the application 
can be accelerated if we reduce the accumulated workload 
of all processors but processor z by Ax so that none of these 
processors would increase its execution time, and allocate 
this additional workload to processor z. This method can 
be applied to optimization of parallel applications on both 
heterogeneous and homogeneous platforms. 


4. MODEL-BASED PARTITIONING ALGO¬ 
RITHM FOR OPTIMAL LOAD IMBAL- 
ANCING 

In this section, we develop the proposed approach for a rel¬ 
atively simple case and introduce a partitioning algorithm 
that finds the optimal distribution of computations of an 
application between homogeneous processors using the func¬ 
tional performance model of the application. 

Consider the following problem. Let p identical parallel pro¬ 
cessors be used to execute the workload of size n, and let 
s(x) be the speed of execution of the workload of size x by 
a processor. Let Ax be the minimal granularity of workload 
so that each processor can be only allocated a multiple of 
Ax. The problem is to find the distribution of the work¬ 
load of size n between the p processors, which minimizes 
the computation time of its parallel execution. 

For simplicity, we assume that ^ be a multiple of Ax and 
p be an even number. Then the following Algorithm [l] will 
solve this problem. 


Algorithm 1 Optimal distribution of workload between ho¬ 
mogeneous processors 

Xropt — Xl 0 pt — X r — Xl — — 

n 

t ■ — P 

^miri — s(-) 

repeat 

x r T — Ax 
xi~ — Ax 

t — max( f r t , ? l ) 
if t < tmin then 

tmin — t 
Xropt — Xr 

Xlopt — Xl 

end if 
until x r < n 


This algorithm returns the optimal distribution of the work¬ 
load of size n between p processors so that each odd proces¬ 
sor receives the workload of size xz op t, and each even proces¬ 
sor receives the workload of size x r0 pt • It can be proved that 
the solution returned by this algorithm will always minimize 
the execution time of the given workload n. Note that the 
traditional load-balanced approach would assign the work¬ 
load of size ^ to all processors. 

It is obvious that if we replace the speed function s{x) by 
any function a X s(x), where a — const , then this algorithm 


will return the same solution. We will use this property 
when applying Algorithm [l] in Section [5~4] 

Algorithm [l] can be easily generalized for an arbitrary (not 
only even) number of processors. 


5. APPLICATION 

In this section, we apply the partitioning algorithm proposed 
in Section [4] to optimization of MPDATA on Intel Xeon Phi. 

5.1 Intel MIC overview 

The Intel MIC architecture is a relatively new system for 
high performance computing |1 . Intel MIC combines many 
integrated Intel CPU cores into a single chip. This ar¬ 
chitecture is built to provide a general-purpose program¬ 
ming environment similar to that provided for Intel CPUs. 
It is capable of running applications written in industry- 
standard programming languages such as Fortran, C, and 
C++. The Intel Xeon Phi (codenamed Knights Corner) is 
the first product based on Intel MIC architecture. This co¬ 
processor is delivered in form factor of a PCI express device, 
and can not be used as a stand-alone processor. However, 
it allows users to directly run individual applications in the 
native mode without the support of CPU. 

In this study, we use the top-of-the-line Intel Xeon Phi 7120P 
coprocessor. It contains 61 cores clocked at 1.238 GHz, and 
16 GB of on-board memory. As the Intel MIC architecture 
supports four-way hyper-threading, it totally gives 244 log¬ 
ical cores (threads) for a single chip. This coprocessor pro¬ 
vides 352 GB/s of memory bandwidth. An important com¬ 
ponent of each Intel Xeon Phi processing core is its vector 
processing unit (VPU) [26 , that significantly increases the 
computing power. Each VPU supports a new 512-bit SIMD 
instruction set called Intel Initial Many Core Instructions. 
The theoretical peak performance of Intel Xeon Phi 7120P 
is 1208 GFlop/s for double precision numbers. 


5.2 Introduction to MPDATA 

The MPDATA algorithm is a general approach for integrat¬ 
ing the conservation laws of geophysical fluids on micro-to- 
planetary scales [23]. It belongs to the class of methods for 
the numerical simulation of fluid flows which are based on 
the sign-preserving properties of upstream differencing. The 
MPDATA scheme allows for solving advection problems, and 
offers several options to model a wide range of complex geo¬ 
physical flows. 


MPDATA corresponds to the group of nonoscillatory forward- 
in-time algorithms. The number of required time steps de¬ 
pends on a type of simulated physical phenomenon, and can 
exceed few millions especially when considering the MPDATA 
algorithm as a part of the EULAG model. For detailed de¬ 
scription of the MPDATA mathematical scheme, the reader 
is referred to 21 22 23]. 


Each MPDATA time step is determined by a set of 17 com¬ 
putational stages, where each stage is responsible for calcu¬ 
lating elements of a certain matrix. These stages represent 
stencil codes which update grid elements according to dif¬ 
ferent patterns. Listing [l] shows a part of the 3D MPDATA 
stencil-based implementation for the 8-th stage. 











Listing 1: Part of 3D MPDATA stencil-based imple¬ 
mentation 

/*...*/ 

//stage 8 

for( ... ) // i — dimension 

for ( ... ) // j — dimension 

for( ... ) // k — dimension 

mx[ i , j ,k]=max(x [ i ] [ j ] [k] , 

X [i-l][j][k], x[i+l][j ] [k] , 
x[i][j-l][k], x[i][j+l][k] , 

X [ i ] [ j ] [k-1] , x [ i ] [j ] [ k+ 1]); 

/*...*/ 


The stages are dependent on each other: outcomes of prior 
stages are usually input data for the subsequent computa¬ 
tions. Every stage reads a required set of matrices from the 
main memory, and writes results to the main memory af¬ 
ter computation. In consequence, a significant traffic to the 
main memory is generated, which mostly limits the perfor¬ 
mance of novel architectures. A single MPDATA time step 
requires 5 input matrices, and returns one output matrix 
that is necessary for the next step. 


5.3 Adaptation of 3D MPDATA to Intel Xeon 
Phi coprocessor 

In our previous work [25, 26 , we proposed the adaptation 
of 3D MPDATA to Intel Xeon Phi coprocessors. The pro¬ 
posed decomposition contributes to ease the memory and 
communication bounds, and to better exploit computation 
resources of Intel Xeon Phi. The resulting adaptation is 
based on the following methodology: 


• (3+l)D decomposition of MPDATA heterogeneous sten¬ 
cil computations; 

• partitioning of threads into independent work teams; 

• parallelization of MPDATA computations; 

• scheduling for multicore and many core systems. 


To workaround the memory-bound nature of MPDATA, we 
proposed the (3+l)D decomposition of MPDATA stencil 
computation [26]. The main aim of this decomposition is 
to take advantage of cache memory reuse by transferring 
the data traffic associated with all intermediate computation 
from the main memory to the cache hierarchy. This aim is 
achieved by using a combination of two well-known loop op¬ 
timization techniques: loop tiling and loop fusion. Such an 
approach allows us to reduce the main memory traffic at the 
cost of additional computations associated with extra areas 
(halo) of all intermediate matrices. Another advantage of 
this approach is the possibility of reducing the main mem¬ 
ory consumption because all intermediate results are stored 
only in the cache memory. 

The proposed decomposition contributes to the data traffic 
from the main memory to the cache hierarchy. In conse¬ 
quence, a lot of inter- and intra-cache communications are 
generated between more than 200 Intel MIC’s processing 
cores. To improve the efficiency of the (3+l)D decomposi¬ 
tion on Intel Xeon Phi, we provided partitioning of available 


cores (threads) into independent work teams. As a result, 
the MPDATA computing domain is partitioned into P sub- 
domains of different sizes, each of which is processed by a sin¬ 
gle work team of threads, according to the proposed (3+l)D 
decomposition. The number of cores (threads) assigned to 
different teams can also be different. Figure [3] shows an ex¬ 
ample of partitioning of 60 Intel MIC’s processing cores into 
4 teams, and partitioning of the MPDATA grid into 4 pieces 
as well. 



50 % 50 % 


Figure 3: Partitioning of Intel MIC’s processing 
cores into 4 work teams 


Within every time step, the work teams execute compu¬ 
tations in parallel and independently of each other. After 
each time step, the work teams are synchronized. Each sub- 
domain is further partitioned into a number of MPDATA 
blocks, where subsequent blocks are processed one by one, 
and each block is processed in parallel by the correspond¬ 
ing work team. Every block is further decomposed into sub¬ 
blocks, where each sub-block is processed by a certain thread 
of the work team. A sequence of all the MPDATA stages 
is executed within every sub-block, taking into account the 
data dependencies. This is achieved at the cost of some 
extra computations performed for halo regions by all teams. 

Our best performance results on a single Intel Xeon Phi 
have been achieved so far by partitioning the 3D n x m x l 
MPDATA domain in two dimensions n and m into four equal 
sub-domains, so that there is one-to-one mapping between 
these sub-domains and the teams of cores arranged in a 2 x 2 
grid as illustrated in Figure [3] This partitioning allowed us 
to balance the load of the core teams and minimize the ex¬ 
ecution time of the application in comparison with all other 
partitioning shapes. 

5.4 Applying model-based partitioning algo¬ 
rithm to MPDATA decomposition 

Thus, the best load-balanced configuration of the MPDATA 
application on a Intel Xeon Phi arranges its cores in four 
15-core teams as shown in Figure [3] and evenly partitions 
the n x m X l computation domain between these teams, 
allocating a ^ x ^ x l sub-domain to each of the teams. 
In this subsection, we use the data-partitioning algorithm, 
presented in Section [4] to find a better partitioning of the 
computation domain between these teams of cores. 

As a first step, we build speed functions of the teams so that 
the speed of each team be represented by a function of prob¬ 
lem size. In the case of MPDATA, the problem size is char¬ 
acterized by the size of the domain processed by the team 
and therefore represented by three parameters n, m and l. 







In real-life NWP simulations l is fixed [26]. Therefore, we 
build speeds of teams as functions of two parameters n and 
ra, setting l to 128, the value typically used in simulations. 

In general, the speed should be measured in equal-sized com¬ 
putation units performed per one time unit [l3], for example, 
in flops. In the case of MPDATA, it is difficult to estimate 
the amount of arithmetic operations that will be executed 
during the processing of a n x m x l computation domain. 
We know however that with a very high level of accuracy 
this amount is directly proportional to the number of cells 
in this domain. Therefore, we measure the speed in cells per 
second. 

The speed functions are built empirically by benchmarking 
the work teams for a range of problem sizes. For each prob¬ 
lem size (n, m), the speed is calculated as nXm t x128 , where 
t is the measured execution time. 

It has been shown 29, [28] that in modern multicore, many- 
core and hybrid platforms, where processing elements are 
coupled and share resources, the speed of one group of el¬ 
ements may depend on the load of others due to resource 
contention. Therefore, the groups cannot be considered as 
independent processing units and their speed cannot be mea¬ 
sured separately and independently. In this work, we use the 
performance measurement method proposed in [28] . Ac¬ 
cording to this method, the performance of the four teams 
of cores is measured simultaneously rather than separately, 
thereby taking into account resource contention. To ensure 
the reliability of measurements, we repeat measurements 
multiple times. We only measure the computation time of 
every team without the overheads of inter-team synchroniza¬ 
tion required after each time step. If the measurements were 
conducted separately, the measured performance of these 
teams would not reflect their actual performance during the 
execution of the application, and therefore performance opti¬ 
mization decisions based on the corresponding performance 
models would be inaccurate. Figure [4] demonstrates the dif¬ 
ference between the speed of team To measured separately 
and simultaneously with other teams. 

Figure [5] illustrates the speed of team To as a function of 
parameters n and m (remember that l = 128). The exper¬ 
imental points for the speed function were obtained with 
steps An s= Am = 4 for both n and m parameters. 

We can see that for a fixed value of m the speed varies very 
slowly and very little with variation of n, staying nearly con¬ 
stant. More detailed analysis of the speed functions confirms 
that the speed of team strongly depends on m and very lit¬ 
tle depends on n. This observation allows us to assume that 
with a high level of accuracy the optimal (or at least a near 
optimal) partitioning of a N x M x 128 domain between 
the four teams can be obtained from the optimal even load- 
balanced partitioning, which allocates a sub-domain of size 
f x f x 128 to each team, by fixing the first parameter n 
to y and varying m. 

Mathematically, it means that we only have to deal with 
speed functions of just one parameter, m. These functions 
are obtained from the previously built speed functions of two 
parameters, n and m, by fixing the parameter n. Geometri- 
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Figure 4: Comparison of speed functions of team 
To, measured separately (aSJ 0 (x)) and simultaneously 
with other three teams (St 0 (x)) executing the same 
workload 
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Figure 5: Experimentally built speed of execution 
of the MPDATA workload by team To as function of 
two parameters n and m (l — 128) 


cally, this can be illustrated as follows. The two-parameter 
speed functions are represented by surfaces. By fixing pa¬ 
rameter n to A., we cut the surfaces by a vertical plane 
n = | as shown in Figure [H] obtaining on this plane curves, 
representing the one-parameter speed functions as shown in 
Figure [6] for n = 120. 

Finally, as all four teams have very close speed functions (as 
can be seen in Figure [6]), we calculate their average (shown 
in Figure [7]) and use it as input to Algorithm [l] to find the 
optimal value of m for each team. 

More specifically, let the MPDATA domain be of size 240 x 
240 x 128. Then, we consider our four teams as four identical 
abstract processors, p = 4, the speed of each of which is 
given by the speed function shown in Figure [7] Note that 
in this function, the amount of workload is given in frames 
of cells of size 120 x 128, while the speed is given in cells 
per second. As pointed in Section [3] despite the unit of 
workload used to measure the speed (axis y) is 120 x 128 
times greater than the unit of workload used to measure the 
size of workload (axis x), we can safely use this function as 
input to Algorthm^ 
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Figure 6: Speeds of four teams built simultaneously 
as functions of parameter m (n = 120 and l = 128) 
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Figure 7: Averaged speed of teams built as a func¬ 
tion of parameter m (m — 120 and l — 128) 


t — max( 


1418579 


1436742 


) = 1.386[s]. 


6. EXPERIMENTAL RESULTS 

In this section, we experimentally evaluate the optimization 
technique presented in Section [5] 

The performance results presented in this section are ob¬ 
tained for double precision MPDATA computations corre¬ 
sponding to 40 time steps. All the benchmarks are compiled 
as native executables using the Intel compiler (v. 15.0.2), and 
run on the Intel Xeon Phi 7120P coprocessor. To ensure the 
reliability of the results, measurements are repeated multi¬ 
ple times, and average execution times are used. We find the 
confidence interval and stop the measurements if the sam¬ 
ple mean lies in the interval with the confidence level 95%. 
We use Student’s t- test, assuming that the individual ob¬ 
servations are independent and their population follows the 
normal distribution. 


Table [I] includes both theoretical and experimental execu¬ 
tion times of MPDATA for the domain of size 240 x 240 x 128. 
These results are obtained for different configurations of 
partitioning, including the traditional ’’load-balanced” parti¬ 
tioning (A m = 0) and a range of ’’unbalanced” partitioning 
for different Am > 0. The theoretically optimal Am re¬ 
turned by Algorithm [l] is equal to 8, which corresponds to 
the configuration where each odd or even team processes the 
sub-domain of size 120 x 128 X 128 or 120 x 112 x 128 respec¬ 
tively. In this case, the estimated execution time of 1.386 
seconds is very close to the real computation time which is 
1.364 seconds. According to experiments, the shortest ex¬ 
ecution time is achieved for Am = 9, when computations 
take 1.348 seconds. 


The solution returned by Algorthm [l] allocates m — 112 
frames to even abstract processors and m — 128 frames to 
odd processors. This corresponds to partitioning of the 240 x 
240 x 128 domain into two sub-domains of size 120 x 112 x 
128, allocated to teams To and Th, and two sub-domains of 
size 120 x 128 x 128, allocated to teams Ti and X3. The 
traditional load-balanced approach partitions the domain in 
four equal sub-domains of size 120 x 120 x 128. This is 
illustrated in Figure [8] 
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Figure 8: Optimal partitioning of MPDATA of size 
240 x 240 x 128 between 4 teams 

The theoretical execution time of the even partitioning is 
1.486 seconds, while the theoretical execution time of the 
uneven partitioning returned by Algorithm^ is 1.386: 

, , X r Xl . 


Table 1: Theoretical and experimental execution 
times for MPDATA domain of size 240 x 240 x 128 
with different configurations of partitioning. The 
odd work teams process the sub-domain of size 
120 x (120 + Am) x 128, while the even teams — 
120 x (120 - Am) x 128. 


Offset 

Am 

Theoretical 
time [s] 

Experimental 
time [s] 

Speedup 

0 

1.486 

1.548 

1.000 

4 

1.470 

1.470 

1.053 

6 

1.401 

1.374 

1.127 

7 

1.422 

1.361 

1.137 

8 

1.386 

1.364 

1.135 

9 

1.398 

1.348 

1.148 

10 

1.397 

1.352 

1.145 

11 

1.429 

1.372 

1.129 

12 

1.402 

1.368 

1.131 


Comparing the experimental and theoretical times, we can 
see that the accuracy of theoretical prediction is very good, 
with prediction errors being as small as 2 — 4%. In gen¬ 
eral, we can identify two main factors contributing into the 
prediction error: 

• While the experimentally built speed functions of teams 
To, Ti, T2 and X3 are not identical, suggesting some de¬ 
gree of their heterogeneity in execution of the MPDATA 

















































workload, our theoretical model considers them homo¬ 
geneous and represents their speed by the average of 
the real speed functions, which is then used as input 
to Algorithm [l] 

• During the construction of the speed functions, the 
speed of a team for problem size nxmxl is measured 
when other teams process in parallel sub-domains of 
the same size, nxmxl. However, during the exe¬ 
cution of the application in our experiments different 
teams process sub-domains of slightly different sizes 
when Am ^ 0. 

Table ^ also demonstrates the performance gain from ap¬ 
plying the proposed load-imbalancing optimization. For the 
imbalanced configurations presented in this table, we notice 
a better performance than for the load-balanced configura¬ 
tion of the MPDATA decomposition. The largest perfor¬ 
mance gain is achieved for Am = 9, giving the speedup of 
1.148x. 

Table 2: Experimental time for all work teams with 
different partitionings: the odd work teams process 
the sub-domain of size 120 x (120 + Am) X 128, while 
the even teams —120 x (120 — Am) X 128. 


Offset 

Am 

Experimental time [s] 

Team 0 

Team 1 

Team 2 

Team 3 

Total 

0 

1.515 

1.498 

1.518 

1.503 

1.548 

4 

1.456 

1.247 

1.455 

1.249 

1.470 

6 

1.364 

1.161 

1.359 

1.162 

1.374 

7 

1.355 

1.161 

1.341 

1.168 

1.361 

8 

1.355 

1.166 

1.349 

1.172 

1.364 

9 

1.340 

1.155 

1.335 

1.161 

1.348 

10 

1.345 

1.141 

1.337 

1.152 

1.352 

11 

1.363 

1.156 

1.357 

1.154 

1.372 

12 

1.360 

1.163 

1.353 

1.165 

1.368 


Table [2] complements the results in Table [T] giving exper¬ 
imental execution times of the individual teams. We can 
clearly see a significant difference between the execution 
times measured for the odd and even teams when Am ^ 0. 
Obviously, this difference is caused by the unbalanced work¬ 
loads for the odd and even teams. However, the total execu¬ 
tion time is shorter than in the case of balanced workloads 
(Am — 0). 

Table [2] also shows that the total execution time is always 
slightly longer than the maximum time among all teams. 
It is mainly due to the fact that the computation time of 
every team is measured without the overheads of inter-team 
synchronization required after each time step. In addition, 
the results in Tableware presented in Figure [9]in a graphical 
form. 

Finally, we evaluate the proposed model-based partitioning 
algorithm for the MPDATA domain of size 480 X 480 x 128. 
As in the previous case, the application is executed for dif¬ 
ferent configurations of partitioning, for a range of Am. In 
this case, however, the theoretically optimal configuration 
returned by Algorithm [l] is exactly the same as the experi¬ 
mentally optimal one, both achieved when Am = 20. The 



Figure 9: Experimental execution times measured 
for individual work teams and the total execution 
time measured for the whole MPDATA workload 


prediction errors are also smaller in this case, not exceed¬ 
ing 3%. The experimental execution time for Am = 20 is 
5.338 seconds, in comparison with 6.140 seconds for the even 
partitioning.This allows us to accelerate the MPDATA com¬ 
putations by 1.15 times. Moreover, the performance gain is 
also observed for other unbalanced configurations, but it is 
smaller than 1.15x. The results of these experiments are 
included in Table [3] 

Table 3: Theoretical and experimental execution 
times for MPDATA domain of size 480 x 480 x 128 
with different configurations of partitioning. The 
odd work teams process the sub-domain of size 

240 x (240 + Am) x 128, while the even teams — 
240 x (240 - Am) x 128. 


Offset 

Am 

Theoretical 
time [s] 

Experimental 
time [s] 

Speedup 

0 

6.136 

6.140 

1.000 

4 

5.731 

5.681 

1.081 

8 

5.809 

5.806 

1.058 

12 

5.543 

5.453 

1.126 

16 

5.509 

5.418 

1.133 

20 

5.499 

5.338 

1.150 

24 

5.624 

5.477 

1.121 


7. CONCLUSION 

Modern compute nodes are characterized by both the in¬ 
creasing number of (possibly, heterogeneous) processing el¬ 
ements and a high level of complexity of their integration. 
Various resources such as caches and data links are shared 
in an hierarchical and non-uniform way. This makes the de¬ 
velopment of efficient applications for such platforms a very 
difficult and challenging task. It would be naive to expect 
that the performance profile of real-life scientific applica¬ 
tions on these platforms will always be comfortably nice and 
smooth to suit traditional load-balancing techniques used 
for minimization of their computation time. Therefore, new 
optimization approaches that do not rely on such increas¬ 
ingly unrealistic assumptions are needed. This work has 
presented one such approach and demonstrated its applica¬ 
bility to optimization of a real-life application on a modern 
HPC platform. 
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